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Abstract 

Logistic regression with unknown sizes has many important applications in biological and medical sciences. All 
models about this problem in the literature are parametric ones. A semiparametric regression model is proposed. 
This model incorporates overdispersion due to the variation of sizes, and allows general dose-response relations. An 
Expectation Conditional Maximization algorithm is provided to maximize the log likelihood. The bootstrap method 
can be used to construct confidence intervals for regression coefficients. Simulation is performed to study the behavior 
of the proposed model. Two real examples are investigated by the proposed model. 
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1 Introduction 

Consider that there are r observations {yi, xi), i = 1, 2, . . . , r, where yi is a binomial random variable with size rii 
and probability pi and Xi is a vector of covariates of length g. The issue of interest is to investigate how the covariates 
Xi affect the probabilities pi. A logistic regression problem arises when the sizes rii are known (e.g., McCullagh and 
Nelder 1999). It can happen that the sizes rii are unknown. 

The author was motivated to study the logistic regression problem with unknown sizes by colony formation assays. 
These assays are used to assess the cytotoxic effects of chemical or physical agents on proliferating cells. In these 
experiments, cells are exposed to the agent of interest, and then placed onto culture plates for colony formation. After 
some time, visible colonies on each plate are counted to decide how many cells survive. The initial number of cells put 
onto each plate is usually unknown. Table ^presents an example in which the survival of M. bovis cells was studied 
(Trajstman 1989). Note that yi is the number of colonies, and Ui is the unknown total number of cells on a culture 
plate. 

There are many other applications. For example, Margolin et al. (1981) studied the effects of quinoline on the 
number of revertant colonies of Salmonella strain TA98. Bailer and Piegorsch (2000) reviewed the statistical methods 
on aquatic toxicology studies and took the effect of nitrofen on the offspring of C. dubia as an example. Morton (1981) 
presented an example of wheat disinfestation by hot air. Elder (1996) investigated the survival of V79-473 cells and 
their exposure times to high temperature. The radiation damage on jejunal crypts has been studied extensively (e.g.. 
Khan et al. 1997, Kinashi et al. 1997, Mason et al. 1999, SaHn et al. 2001, and Goel et al. 2003). 

In the literature, the response yi is usually assumed to be a Poisson random variable, such as Wadley (1949) and 
Margolin et al. (1981). Such an approximation is inappropriate when ri; and pi are moderate in size (e.g.. Elder et 
al. 1999). Anscombe (1949) considered overdispersion relative to the Poisson distribution and developed a model 
based on the negative-binomial distribution. Baker et al. (1980) treated yi as a Poisson random variable. The yi 
in the control group have a common mean m and those in the treatment group rapi, where a probit dose-response 
relation is assumed. Trajstman (1989) modified the method of Baker et al. (1980) to allow a logistic dose-response 
relation and incorporated overdispersion by assuming a scaled Poisson variance-mean relationship. Morgan and Smith 
(1992) also based their work on Baker et al. (1980), and used a negative-binomial variance/mean relationship with 
a heterogeneity factor to handle extra Poisson variation. Kim and Taylor (1994) and Elder et al. (1999) developed 
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Table 1: The M. bovis cell survival data. 
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a quasi-likelihood approach by regarding yi\ni as a binomial random variable. Kim and Taylor (1994) assumed that 
E{ni) = \i and var {rii) = XiV with Ai known and v ^ 1 unknown. Elder et al. (1999) estimated A = E{ni) with 
var (ni) = A(l + J^A) and ly ^ 0. All previous methods used parametric models. 

We propose a semiparametric regression model, in which each m is assumed to be a Poisson random variable with 
mean Ai, and the \i are assumed to arise as a random sample from an unspecified mixing distribution. By doing this, 
a rich pool of distributions can be used for A,. 

In Section 2, a semiparametric model is formulated, and an Expectation Conditional Maximization (ECM) algo- 
rithm that maximizes the log likelihood is described. The issues of selecting the number of support points and using 
the bootstrap method are also discussed. Simulation results are shown in Section 3. Section 4 apphes the proposed 
model to two real examples. One is from an M. bovis cell survival assay, and the other from a jejunal crypt stem cell 
survival assay. 

2 Methods 

2.1 A semiparametric model 

The probability pi can be written as pi = h{xi] f3), where h is the inverse of a link function, e.g., = logit or 
probit. Note that his a general function of Xi and (3. The unknown size rii is assumed to be a Poisson random variable 
with mean Aj. It is clear that yi given Aj is a Poisson random variable with mean Xih{xi; (3). The nuisance parameters 
Xi are further assumed to follow a mixing distribution G. Because the parameter of interest /9 is in the £i-dimensional 
Euclidean space, a semiparametric regression model iirises when G is treated nonparametrically. The density of a 
single generic observation (y, x) is 

/(y;x,/3,G)= / /(y;x,/3,A)rfG(A), 
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where f{y; x, f3, A) is a Poisson density with mean \h{x; (3), i.e.. 



/(y; X, /3, A) = exp{-A/i(a;; /3)}{A/i(a;; (3)Y ly\, y = 0, 1, . . . . 



The log hkehhood can be written as 



r 



£{f3,G) ^Y.^og f{y,;x„f3,G). 



(1) 



2.2 An ECM algorithm 

In order to maximize £{/3, G) in ([0, first we will consider the case that G is a discrete distribution with a fixed number 
of support points. Let G — J2j=i '^j^i^j)' where '^j — 1' '^j > 0, (5 is the indicator function, and Aj e (0, oo). 

Let a. ~ {ai, a2, ■ ■ ■ , uk)' , ^ = (Ai, A2, . . . , A^)' and 6 =(/3, a. A). The log likelihood l{(3, G) in Q can be 



One may consider using an EM algorithm to maximize £{6) in (|2j. However, the M-step in the EM algorithm may be 
computationally unreliable. 

We will consider an ECM algorithm (Meng and Rubin 1993; McLachlan and Peel 2000, pl48). The ECM al- 
gorithm simplifies the M-step by replacing the complicated M-step with three computationally simpler and stabler 
conditional maximization (CM) steps. It also drives up the log likelihood at each iteration (Meng and Rubin 1993). 

Suppose the missing datum is 2; = (zi, 22, zk)' , the indicator vector for the pair (cc, y), where Zj = 1 for some 
i and Zk = for all k ^ j, i.e., A = Aj, j = 1,2, . . . , K. Note that z is multinomial distributed with size one and 
probability ex. The complete density for a single datum {x, z, y) is ^f=\ V^jfj iui /3, ■ The joint complete log 



written as 




(2) 



hkelihood is 



r 



K 



The expected conditional log likelihood to be maximized is 

H^(0;0(°)) = i?e(o, {£c{0)\yi,y2,-,yr}. 



The E-step involves getting the conditional expectation of Zij, i.e.. 




'e(o){ztj\yi,y2,--,yr) 



af/,fa;x.,/3W,Af) 



for i = 1, 2, . . . , r and j = 1, 2, . . . , if . 
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In the CM-step, we need to maximize the expected conditional complete log likelihood 

w{e- = j^Y. i°g"^- + E E log /^(y- ^'"3' 

2—1 j—1 i—1 j—1 

r K 

= constant + ^ ^ tt^-^'' log 



Ti(a) 



if 



+ EE'^«i^ {y^logAj +y,log/i(a;j;/3) - \jh{x.,; (3)] 
over a, A, /3 sequentially. The maximum likelihood estimator (MLE) for a. is 



(1) 
a )j ^ r 



(3) 



The conditional MLE for A given /3 = /3^*^^ is 



A 



(1) 



(0) 



The conditional MLE for (3 given A = A^^^ is 



13^^^ = argmaxT2(;3, A^^)). 



(4) 



(5) 



Since there is no analytic solution for /J^^-* in the optimization problem defined in (|5}, a Newton Raphson algorithm is 
applied. The first order derivative of T2{f3, A^^-*) is 



r K 



-(0) 



and the second order derivative is 
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The Newton Raphson algorithm is defined by, with /3/Q^ =f3^'^\ 
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(6) 



2.3 Selecting the number of support points 

By increasing the number of support points of G, the maximized log likelihood l{6) can be increased. One may 
consider using the global maximizer by trying different values of K. In order to obtain a reasonable and parsimonious 
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Table 2: Simulation results: sd stands for standard deviation, qi for 95% quantile interval, and mse for mean square 
error. 
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(0.25,0.75) 


(100,300) 


0.003 


0.025 


(0.954, 1.049) 


0.001 


3 


1 


(0.5,0.5) 


(450,650) 
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fit to the data, we propose to choose the number of support points by minimizing the BIC (e.g., Wang et al. 1996), i.e., 

k= argmin {-2^(0} + log(r)(2if - 1 + p)}. 
Ke{i,2,--- } 

2.4 The bootstrap method 

The bootstrap method can be applied to obtain confidence intervals for the regression coefficients /3. For a random 
design, the nonparametric bootstrap method can be applied, in which one can sample the pairs (y;, Xi). For a fixed 
design, we propose to use a parametric bootstrap method. A resample of size r is generated as follows, 

Vi ~ /(y; Xi, f3, Xi), i = 1, 2, . . . , r, 

where is a random variable drawn from the estimated mixing distribution G, 

K 

3 Simulation 

We report a simulation study in which there is a single covariate x. There are 10 replications for each integer x in 
[—5, 5], so that r = 110. A logistic dose-response relation is assumed, i.e., 

log \ , ^' > ^ (3o+ Pi x^. 
1 1 - Pi J 

The intercept (3q is fixed to be one. A 2^ design is considered, i.e.. 




Pi (01,0:2) (Al,A2) 



For each setting, 800 samples are generated. The results are shown in Table |2l One can observe that the bias, 
standard deviation and mean square error of the slope /3i are quite small. The /3i falls into the 95% quantile interval, 
with ends being 2.5% and 97.5% quantiles. 
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Table 3: The estimates of the mixing distribution and the BIC for the M. bovis data. 
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4 Example 

4.1 An M. bovis cell survival assay 

The data in Tableware part of Table 1 in Trajstman (1989) and also studied by Morgan and Smith (1992). M. bovis 
cells were treated with one of the decontaminants, HPC or oxalic acid with one concentration, then placed on the 
culture plates for colony formation. After 12 weeks (at stationarity), the M. bovis colonies were counted. Trajstman 
(1989) and Morgan and Smith (1992) treated the count of three colonies for HPC dose at 0.00075 as an extreme 
observation and omitted it from all analysis. However, such a small count can be automatically taken care of in the 
proposed semiparametric model. 

An ANOVA model is fitted with a separate factor for each level of the decontaminants. Let Xj denote a factor for 
the concentration level j of the decontaminants. It is assumed that the pi satisfy that 

11 

log{-^} =/3o + ^/3j:e,,, i = 1,2,..., 129, (7) 

j=i 

where (3q is the control effect and /3j is the effect difference between dose j and the control dose, j — 1,2, ... ,11. 

The results of estimated mixing distributions are in Table |3] The smallest BIC corresponds to K = 3. When 
K ^ 3, the estimate G is written as 

G = 0.0465(9.391) + 0.840(5(69.52) + 0.115 5(107.1). 

Table|3presents the results for the regression coefficients. In the bootstrap, 200 resamples are drawn. The bootstrap 
standard errors of the regression coefficients are small. Since all 95% confidence intervals except those of /3o and /3g 
do not include 0, all treatment doses except HPC 0.0075 have more negative effects on survival of M. Bovis cells than 
the control. The MLEs /3g and (3g violate the dose-response monotonicity relationship, i.e., increased negative effects 
on the response associated with increasing dosage of the decontaminants. This is consistent with the monotonicity 
violation in their sample means in Table Q More investigation is needed for the data. The estimates f3j are not 
comparable with those in Trajstman (1989) and Morgan and Smith (1992), which used a simple linear model in 0. 
Figure [l]presents the responses y and their fitted values, which shows that the model fits very well. 
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Table 4: The estimated regression coefficients, bootstrapped standard error, and 95% confidence interval for the M. 
Bovis data. 
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Figure 1: The response y, sample mean y and fitted value y. 
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Table 5: Jejunal crypt data results from the proposed and previous approaches (logistic regression and Kim's method 
fix Hi and E{ni) at 160, respectively; Kim's and Elder's quasi-likelihood method of moments estimates come from 
Elder etal. (1999)). 





estimate (standard error) 


logistic 


Kim's Elder's 
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/So 


7.432 (0.175) 


7.410(0.191) 6.727(0.725) 


6.705 (0.746) 
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-1.185 (0.024) 


-1.183(0.026) -1.126(0.061) 


- 1.124(0.059) 


A 




_ 194.7 (43.4) 


196.1 



4.2 A jejunal crypt stem cell survival assay 

Table 1 in Elder et al. (1999) presents a surviving jejunal crypt data set from an experiment done on 126 mice. Note 
that the colony count of 12 for dose 9.25 is redundant and should be removed. Kim and Taylor (1994) also investigated 
the data set. A jejunal crypt is a compartment containing stem cells in a certain region of the intestine. These cells 
are responsible for maintaining the function of the intestine. In such an experiment, mice are treated by a certain dose 
of gamma rays, and then killed to count the number of surviving crypts. Because the experiment needs live mice, the 
total number of crypts in each mouse is unknown. It is assumed that the surviving probabihties pi satisfy that 

logj^^l =/3o + /?ix„ i = 1,2,..., 126, 

where Xi is the gamma dose. 

The BIC are 724.3 for X = 1 and 734.0 for K = 2. With K = \, the estimated G is degenerated at A = 196.1. We 
draw 200 bootstrap resamples. Table|5]compares the estimates of the proposed method with the previous methods. All 
the estimates of previous methods fall into our 95% confidence intervals: (5.089, 8.023) for /3o and (—1.241, —1.009) 
for (3i. The standard errors of the regression coefficients are quite small. Because no confidence intervals include 0, 
the regression coefficients are significant at the significance level of 0.05. 

5 Discussion 

We propose a flexible semiparametric model for the logistic regression problem with unknown sizes, in which the 
regression coefficients can be estimated together with the nuisance parameter, the mixing distribution. 

The parameter estimates in the proposed model can be obtained effectively by an ECM algorithm. When one runs 
the ECM algorithm, good initial values will help find the MLEs quickly. One may run a Poisson regression analysis 
to find the initial values of (3. 

References 

[Anscombe, 1949] Anscombe, F. J. (1949). Note on a problem in probit analysis. Annals of Applied Biology , 36:203- 
205. 

[Bailer and Piegorsch, 2000] Bailer, A. J. and Piegorsch, W. W. (2000). From quantal counts to mechanisms and 
systems: the past, present, and future of biometrics in environmental toxicology. Biometrics, 56:327-336. 

[Baker et al., 1980] Baker, R. J., Pierce, C. B., and Pierce, J. M. (1980). Wadley's problem with controls. GLIM 
Newsletter, 3:32-35. 



8 



[Elder, 1996] Elder, J. A. (1996). Development of quasi-likelihood techniques for the analysis of pseudo-proportional 
data. Unpublished doctoral dissertation, Virginia Commonwealth University, Medical College of Virginia, Depart- 
ment of Biostatistics. 

[Elder et al., 1999] Elder, J. A., Carter, W. H., Gennings, C, and Elswick, R. K. (1999). A quasi-UkeUhood approach 
for overdispersed binomial data when N is unobserved. Journal of Agricultural, Biological, and Environmental 
Statistics, 4:102-115. 

[Goel et al., 2003] Goel, H. C, Salin, C. A., and Prakash, H. (2003). Protection of jejunal crypts by rh-3 (a preparation 
of hippophae rhamnoides) against lethal whole body gamma irradiation. Phytotherapy Research, 17:222-226. 

[Khan et al., 1997] Khan, W. B., Shui, C. X., Ning, S. C, and Knox, S. J. (1997). Enhancement of murine intestinal 
stem cell survival after irradiation by keratinocyte growth factor. Radiation Research, 148(3):248-253. 

[Kim and Taylor, 1994] Kim, D. K. and Taylor, J. M. G. (1994). Transform-both-sides approach for overdispersed 
binomial data when N is unobserved. Journal of the American Statistical Association, 89(427):833-845. 

[Kinashi et al., 1997] Kinashi, Y., Ono, K., and Abe, M. (1997). The micronucleus assay of lymphocytes is a useful 
predictive assay of the radiosensitivity of normal tissue: a study of three inbred strains of mice. Radiation Research, 

148(4):341-347. 

[Margolin et al., 1981] Margohn, B. H., Kaplan, N., and Zeiger, E. (1981). Statistical analysis of the Ames 
salmonella/microsome test. Proceedings of the National Academy of Sciences, 78:3779-3783. 

[Mason et al, 1999] Mason, K. A., Kishi, K., Hunter, N., Buchmiller, L., Akimoto, T., Komaki, R., and Milas, L. 
(1999). Effect of docetaxel on the therapeutic ratio of fractionated radiotherapy in vivo. Clinical Cancer Research, 
5:4191^198. 

[McCullagh and Nelder, 1989] McCuUagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and 
Hall, London, 2 edition. 

[McLachlan and Peel, 2000] McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley. 

[Meng and Rubin, 1993] Meng, X. L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algo- 
rithm: a general framework. Biometrika, 80:267-278. 

[Morgan and Smith, 1992] Morgan, B. J. T. and Smith, D. M. (1992). A note on Wadley's problem with overdisper- 
sion. Applied Statistics, 41:349-354. 

[Morton, 1981] Morton, R. (1981). GeneraUzed spearman estimators of relative dose. Biometrics, 37:223-233. 

[Salin et al., 2001] Salin, C. A., Samanta, N., and Goel, H. C. (2001). Protection of mouse jejunum against lethal 
irradiation by podophyllum hexandrum. Phytomedicine, 8(6):413^22. 

[Trajstman, 1989] Trajstman, A. C. (1989). Indices for comparing decontaminants when data come from dose- 
response survival and contamination experiments. Applied Statistics, 38:481^94. 

[Wadley, 1949] Wadley, F. M. (1949). Dosage-mortahty correlation with number treated estimated from a parallel 
sample. Annals of Applied Biology, 36:196-202. 

[Wang et al., 1996] Wang, P., Puterman, M. L., Cockburn, 1., and Le, N. D. (1996). Mixed poisson regression models 
with covariate dependent rates. Biometrics, 52:381^00. 



9 



